{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "In this lab we'll look at:\n",
    "- How to build ROC curves\n",
    "- Use two different evaluation metrics to perform feature ranking\n",
    "- Compare/contrast the results of feature ranking on different evaluation measures\n",
    "- Build models on subsets of the features, using the different methods to select the subset\n",
    "- Compare these different models\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.stats import entropy\n",
    "import os\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we'll load the dataset and take a quick peak at its columns and size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#load dataset\n",
    "cwd = os.getcwd()\n",
    "datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n",
    "f = datadir + 'ads_dataset_cut.txt'\n",
    "data = pd.read_csv(f, sep = '\\t')\n",
    "data.columns, data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next step we'll use the Decision Tree classifier's built in feature importance attribute to compute the normalized Mutual Information/Information Gain of each feature. Note a few things about this approach: 1). With extremely high dimensional data, one may want to calculate the normalized MI directly for each feature (the code to do that is a bit more complex so we used the DT instead), 2). The DT is a greedy algorithm, so the feature importance ranks it produces may not be equal to the rank of normalized MI calculated individually."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#import the decision tree module from sklearn\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "#build a decision tree with max_depth = 20 using entropy\n",
    "Y = data['y_buy']\n",
    "X = data.drop('y_buy', 1)\n",
    "\n",
    "#Student - instantiate the DT\n",
    "dt = \n",
    "#Student - now fit the DT\n",
    "\n",
    "#Student - Now use built in feature importance attribute to get MI of each feature and Y\n",
    "feature_mi = "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll add the feature importances to a dictionary where key-values are: {feature_name:dt_feature_importance}. This can be done in one line using the zip and dict functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student - Add features and their importances to a dictionary\n",
    "feature_mi_dict = "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to compute feature ranks using AUC. We can do this without fitting a model, by just seeing how well the individual feature ranks the positives and negatives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#define a function to print ROC curves. \n",
    "#It should take in only arrays/lists of predictions and outcomes\n",
    "from sklearn.metrics import roc_curve, auc\n",
    "\n",
    "def plotUnivariateROC(preds, truth, label_string):\n",
    "    '''\n",
    "    preds is an nx1 array of predictions\n",
    "    truth is an nx1 array of truth labels\n",
    "    label_string is text to go into the plotting label\n",
    "    '''\n",
    "    #Student input code here\n",
    "    #1. call the roc_curve function to get the ROC X and Y values\n",
    "    fpr, tpr, thresholds = \n",
    "    #2. Input fpr and tpr into the auc function to get the AUC\n",
    "    roc_auc = \n",
    "    \n",
    "    #we are doing this as a special case because we are sending unfitted predictions\n",
    "    #into the function\n",
    "    if roc_auc < 0.5:\n",
    "        fpr, tpr, thresholds = roc_curve(truth, -1 * preds)\n",
    "        roc_auc = auc(fpr, tpr)\n",
    "\n",
    "    #chooses a random color for plotting\n",
    "    c = (np.random.rand(), np.random.rand(), np.random.rand())\n",
    "\n",
    "    #create a plot and set some options\n",
    "    plt.plot(fpr, tpr, color = c, label = label_string + ' (AUC = %0.3f)' % roc_auc)\n",
    "    \n",
    "\n",
    "    plt.plot([0, 1], [0, 1], 'k--')\n",
    "    plt.xlim([0.0, 1.0])\n",
    "    plt.ylim([0.0, 1.0])\n",
    "    plt.xlabel('FPR')\n",
    "    plt.ylabel('TPR')\n",
    "    plt.title('ROC')\n",
    "    plt.legend(loc=\"lower right\")\n",
    "    \n",
    "    return roc_auc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll run each feature through the above function to get its invdividual AUC and also plot on a chart. We add some extra lines of matplotlib code to control the formatting and position of the legend. We also want to add each to a dictionary of the format {feature_name:feature_auc}, similar to what we did above (though not using the same one liner). Take some time to review the chart and think about why different features produce differently shaped curves. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "\n",
    "fig = plt.figure(figsize = (12, 6))\n",
    "ax = plt.subplot(111)\n",
    "\n",
    "#Plot the univariate AUC on the training data. Store the AUC\n",
    "\n",
    "feature_auc_dict = {}\n",
    "for col in data.drop('y_buy',1).columns:\n",
    "    #Student put code here\n",
    "    feature_auc_dict[col] = \n",
    "\n",
    "\n",
    "# Put a legend below current axis\n",
    "box = ax.get_position()\n",
    "ax.set_position([box.x0, box.y0 + box.height * 0.0 , box.width, box.height * 1])\n",
    "ax.legend(loc = 'upper center', bbox_to_anchor = (0.5, -0.15), fancybox = True, \n",
    "              shadow = True, ncol = 4, prop = {'size':10})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we want to add both of the dictionaries created above into a data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student - Add auc and mi each to a single dataframe\n",
    "df_auc = \n",
    "df_mi = \n",
    "\n",
    "#Student - Now merge the two on the feature name\n",
    "feat_imp_df = \n",
    "feat_imp_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To put the different metrics on the same scale, we'll use pandas rank() method for each feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student - Now create a df that holds the ranks of auc and mi \n",
    "feat_ranks =\n",
    "\n",
    "#Student - Plot the two ranks\n",
    "\n",
    "#Student - Plot a y=x reference line\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do the feature importance metrics above completely agree with each other (in terms of rank order)? If not, why do you suppose?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Student - Now create lists of top 5 features for both auc and mi\n",
    "top5_auc = \n",
    "top5_mi = \n",
    "top5_auc, top5_mi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is the conclusive step from all the analysis done above. We want to test which method above can be used to produce the best subset of features. What we'll do is use the top 5 features ranked by both AUC and the decision tree feature importance and compare them against each other with different algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "'''\n",
    "Now do the following\n",
    "1. Split the data into 80/20 train/test\n",
    "2. For each set of features:\n",
    "- build a decision trees max_depth = 5 \n",
    "- build a logistic regression C = 100\n",
    "- get the auc and log-loss on the test set\n",
    "'''\n",
    "\n",
    "\n",
    "def getLogLoss(Ps, Ys, eps = 10**-6):\n",
    "    return ((Ys == 1) * np.log(Ps + eps) + (Ys == 0) * np.log(1 - Ps + eps)).mean()\n",
    "\n",
    "#Student - Split into train and test randomly without using sklearn package\n",
    "#Note, there are many ways to do this:\n",
    "\n",
    "train_pct = 0.8\n",
    "#1. create an array of n random uniform variables drawn on [0,1] range\n",
    "rand = \n",
    "#2. Convert to boolean where True = random number < train_pct\n",
    "rand_filt = \n",
    "\n",
    "#Student - Use the filter to index data into training and test data sets\n",
    "train = \n",
    "test = \n",
    "\n",
    "\n",
    "fsets = [top5_auc, top5_mi]\n",
    "fset_descr = ['auc', 'mi']\n",
    "mxdepths = [5]\n",
    "Cs = [10**2]\n",
    "\n",
    "\n",
    "#Set up plotting box\n",
    "fig = plt.figure(figsize = (15, 8))\n",
    "ax = plt.subplot(111)\n",
    "\n",
    "\n",
    "\n",
    "for i, fset in enumerate(fsets):\n",
    "    \n",
    "    descr = fset_descr[i]\n",
    "    #set training and testing data\n",
    "    Y_train = train['y_buy']\n",
    "    X_train = train[fset]\n",
    "    Y_test = test['y_buy']\n",
    "    X_test = test[fset]\n",
    "    \n",
    "    \n",
    "    #Student - for all d in mxdepths and C in Cs, build DT's and LR's respectively\n",
    "    # get the predictions on the test set and also get the log-loss, then plot\n",
    "    \n",
    "    #Student - instantiate the class\n",
    "    dt = \n",
    "    #Don't forget to fit the tree\n",
    "    #Now make a prediction\n",
    "    preds_dt = \n",
    "    #Now compute the log-loss\n",
    "    ll_dt = \n",
    "        \n",
    "    plotUnivariateROC(preds_dt, Y_test, '{}:DT:md={}:(LL={})'.format(descr, d, round(ll_dt, 3)))\n",
    "\n",
    "        \n",
    "    #Student - instantiate the class\n",
    "    lr = \n",
    "    #Don't forget to fit the LR\n",
    "    #Now make a prediction\n",
    "    preds_lr = \n",
    "    #Now compute the log-loss\n",
    "    ll_lr = \n",
    "\n",
    "    plotUnivariateROC(preds_lr, Y_test, '{}:LR:C={}:(LL={})'.format(descr, C, round(ll_lr, 3)))\n",
    "\n",
    "    \n",
    "# Put a legend below current axis\n",
    "box = ax.get_position()\n",
    "ax.set_position([box.x0, box.y0 + box.height * 0.0 , box.width, box.height * 1])\n",
    "ax.legend(loc = 'upper center', bbox_to_anchor = (0.5, -0.15), fancybox = True, \n",
    "              shadow = True, ncol = 2, prop = {'size':10})\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
